Day 06 - 單元一：描述統計（一） - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

2017 iT 邦幫忙鐵人賽

DAY 6

Big Data

資料科學：使用 Clojure系列第 6 篇

Day 06 - 單元一：描述統計（一）

2017鐵人賽 clojure statistics incanter

haroldwu

2016-12-21 23:58:00

1826 瀏覽

分享至

範例程式

所有的範例程式碼都可以從 Github repo 獲得。第一章的程式碼則是 Chapter 01。範例資料來源自 the Complex Systems Research Group at the Medical University of Vienna，請使用第一章 repo 中附上的 script/download-data.sh 下載。

探勘資料（Inspecting the data）

資料下載完後，就可以開始嘗試用 Incanter 模組進行基本操作了。Incanter 是嘗試將 R 的 data.frame 概念及操作、建構在 data.frame 上的工具移植到 Clojure 的函式庫。

在專案中使用 Incanter，請在 project.clj 中引入：

:dependencies [[incanter/incanter-core "1.5.5"]
               [incanter/incanter-stats "1.5.5"]]

讀取資料

Incanter 下有很多模組。引用會用到的即可。讓我們來讀取資料：

CSV 格式請用 incanter-io
Excel 格式請用 incanter-excel
非常見格式請自行寫處理函數配合 dataset 這個函數轉為 Incanter 可讀取的狀態

(ns cljds.ch1.data
  (:require [clojure.java.io :as io]
            [incanter.core :as i]
            [incanter.excel :as xls]))

(defmulti load-data identity)
(defmethod load-data :uk [_]
  (-> (io/resource "UK2010.xls")
      (str)
      (xls/read-xls)))

這邊使用 multimethod 的方式，定義了一個可以方便取用 UK2010 數據的語句 (load-data :uk)。接著我們使用 incanter.core 的功能：col-name，這個函數可以返回 data.frame 中的各 column 的名字。

(defn ex-1-1 []
  (i/col-name (load-data :uk)))

ex- 開頭的函數可以用 lein 直接執行，所以我們下：lein run -e 1.1，即可看到輸出！

資料科學小訣竅：處理前須了解每個 column 的意義及資料類型

各行具體內容

(defn ex-1-2 []
  (i/$ "Election Year" (load-data :uk)))
; (2010.0 2010.0 2010.0 2010.0 2010.0 ... 2010.0 2010.0 nil)

(defn ex-1-3 []
  (->> (load-data :uk)
       (i/$ "Election Year")
       (distinct)))
; (2010 nil)

第一個函數輸出所有的值。第二個函數則用 distinct 將值分開。可以發現，資料中存在有某些行，它們的 Election Year 值是空（nil）的。有多少個行的資料呢？

(defn ex-1-4 []
  (->> (load-data :uk)
       (i/$ "Election Year")
       (frequencies)))
; {2010.0 650 nil 1}

使用 frequencies 函數來看有多少筆 nil。可以看到，是一筆。

數據清理（Data Scrubbing）

資料科學小訣竅：有人說資料科學家的 80% 工作內容都是資料清理

(-> (load-data :uk)
    (i/query-dataset {"Election Year" {:$eq nil}}))

query-dataset 是反向選擇資料。若要跟 SQL 中的 WHERE 一樣是篩選出需要的資料，則是使用 i/$w。

Clojure	Math	English
`:$gt`	>	greater than
`:$lt`	<	less than
`:$gte`	>=	greater than or equal to
`:$lte`	<=	less than or equal to
`:$eq`	==	equal to
`:$ne`	!=	not equal to
`:$in`		positive membership
`:$nin`		negative membership
`:$fn`		a self defined comparison function

接著讓我們通過 map to key 的方式多了解有問題的行：（把那一整行以 column name 為 key，實際值為 value 的方式轉為 map。多於一行的話，map 會存在一個 vector 中。

(defn ex-1-5 []
  (->> (load-data :uk)
       (i/$where {"Election Year" {:$eq nil}})
       (i/to-map)))
; {:ILEU nil, :TUSC nil, :Vote nil ... :IVH nil, :FFR nil}

結果這一整行都是空的！所以我們可以去掉這行。下列程式產生清理過的資料：

(->> (load-data :uk)
     (i/$where {"Election Year" {:$ne nil}}))

經過檢測，nil 資料已被除去。因此，我們可以重新寫一開始的 load-data，讓程式能夠提供我們正確的資料：

(defmethod load-data :uk-scrubbed [_]
  (->> (load-data :uk)
       (i/$where {"Election Year" {:$ne nil}})))

描述統計（Descriptive statistics）

描述統計不做資料的意義的猜測（相對於推論統計 inferential statistics），主要目的是展示資料的各種分布屬性。

(defn ex-1-6 []
  (->> (load-data :uk-scrubbed)
       (i/$ "Electorate")
       (count)))

count 是最基本的計次函數！

數學表示法與 Clojure

Name	Math Symbol	Clojure
Sigma	Σ	`(reduce + xs)`
Pi	Π	`(reduce * xs)`

所以，以平均為例，函數如下

(defn mean [xs]
  (/ (reduce + xs)
     (count xs)))

(defn ex-1-7 []
  (->> (load-data :uk-scrubbed)
       (i/$ "Electorate")
       (mean))
; 70149.94

不過，Incanter 其實已經提供了平均函數 mean 可以直接調用了。這個函數存在於 incanter.stats 這個 namespace 中，因為接下來會用到，所以請添加到我們的 ns 中吧！